Predicting Price with Size, Location, and Neighborhood

In this project the goal is to predict apartment prices in Buenos Aires, Argentina making use of predictive data science through a linear regression model. A few rules must followed:

  • The property type must be an apartment
  • The price must be less than 400 000 USD to be considered
In [1]:
import warnings
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from glob import glob
from category_encoders import OneHotEncoder
from ipywidgets import Dropdown, FloatSlider, IntSlider, interact
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline

warnings.simplefilter(action="ignore", category=FutureWarning)

Preparing and Exploring the Data¶

Checking the shape, info and head to get a sense of what kind of data I will be working with.

In [2]:
df1 = pd.read_csv("data-1.csv", encoding="ISO-8859-1")
In [3]:
df1.shape
Out[3]:
(8606, 16)
In [4]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8606 entries, 0 to 8605
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   operation                   8606 non-null   object 
 1   property_type               8606 non-null   object 
 2   place_with_parent_names     8606 non-null   object 
 3   lat-lon                     6936 non-null   object 
 4   price                       7590 non-null   float64
 5   currency                    7590 non-null   object 
 6   price_aprox_local_currency  7590 non-null   float64
 7   price_aprox_usd             7590 non-null   float64
 8   surface_total_in_m2         5946 non-null   float64
 9   surface_covered_in_m2       7268 non-null   float64
 10  price_usd_per_m2            4895 non-null   float64
 11  price_per_m2                6520 non-null   float64
 12  floor                       1259 non-null   float64
 13  rooms                       4752 non-null   float64
 14  expenses                    875 non-null    object 
 15  properati_url               8606 non-null   object 
dtypes: float64(9), object(7)
memory usage: 1.1+ MB
In [5]:
df1.head()
Out[5]:
operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms expenses properati_url
0 sell apartment |Argentina|Capital Federal|Villa Crespo| -34.6047834183,-58.4586812499 180000.0 USD 2729232.0 180000.0 120.0 110.0 1500.000000 1636.363636 NaN 4.0 NaN http://villa-crespo.properati.com.ar/12egq_ven...
1 sell house |Argentina|Bs.As. G.B.A. Zona Oeste|La Matanza... NaN 250000.0 USD 3790600.0 250000.0 117.0 120.0 2136.752137 2083.333333 NaN 4.0 NaN http://ramos-mejia.properati.com.ar/s7pd_venta...
2 sell house |Argentina|Bs.As. G.B.A. Zona Oeste|Morón|Cast... -34.6497002,-58.658073 410000.0 USD 6216584.0 410000.0 410.0 220.0 1000.000000 1863.636364 NaN NaN NaN http://castelar-moron.properati.com.ar/11vgn_v...
3 sell house |Argentina|Bs.As. G.B.A. Zona Oeste|Tres de Fe... -34.5957086,-58.5669503 180000.0 USD 2729232.0 180000.0 200.0 135.0 900.000000 1333.333333 NaN 5.0 NaN http://tres-de-febrero.properati.com.ar/7f7u_v...
4 sell apartment |Argentina|Capital Federal|Chacarita| -34.5846508988,-58.4546932614 129000.0 USD 1955949.6 129000.0 76.0 70.0 1697.368421 1842.857143 NaN NaN NaN http://chacarita.properati.com.ar/10qlv_venta_...

At first glance there are a few problems that need to be taken care of. Since we want to predict the value with the location the NaN values in the "lat-lon" column must be treated. For now I will explore the data further.

Seems like there are several files with similar information so instead of processing them one by one it would be ideal to build a function that can do everything at once.

In [6]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]

    # Return new Data Frame
    return df
In [7]:
df = wrangle("data-1.csv")
df.shape
Out[7]:
(1781, 16)

Size is usually an important factor in determining the price of an apartment. With that in mind, it is important to look at the distribution of apartment sizes in the dataset.

In [8]:
# Plotting a histogram related to the price per m2
plt.hist(df["surface_covered_in_m2"])

# Detailing title and axis, ';' to supress any text output
plt.title("Distribution of Apartment Sizes")
plt.xlabel("Area [sq meters]")
plt.ylabel("Frequency");

Seems like the dataset has some extreme outliers. This can affect model performance — especially in the sorts of linear models. To confirm, the describe method will be in handy.

In [9]:
df.describe()
Out[9]:
price price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms
count 1.781000e+03 1.781000e+03 1781.000000 1247.000000 1635.000000 1198.000000 1618.000000 482.000000 1385.000000
mean 2.285410e+05 2.178037e+06 143647.215239 114.417001 97.877064 2231.534726 4340.345113 10.085062 2.434657
std 4.100189e+05 1.141710e+06 75298.758506 1755.263895 1533.057610 650.139587 8126.940810 44.783320 1.065193
min 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 2.224587 1.000000 1.000000
25% 9.000000e+04 1.349454e+06 89000.000000 40.000000 38.000000 1815.485830 1973.750000 2.000000 2.000000
50% 1.300000e+05 1.849813e+06 122000.000000 55.000000 50.000000 2151.642336 2411.239496 4.000000 2.000000
75% 1.911340e+05 2.653420e+06 175000.000000 79.000000 73.000000 2577.003583 2958.157462 7.000000 3.000000
max 5.487000e+06 6.034635e+06 398000.000000 62034.000000 62034.000000 5975.609756 57156.250000 616.000000 6.000000

It will be easier to read if I take out the scientific notation on big numbers though:

In [10]:
df.describe().apply(lambda s: s.apply('{0:.5f}'.format))
Out[10]:
price price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms
count 1781.00000 1781.00000 1781.00000 1247.00000 1635.00000 1198.00000 1618.00000 482.00000 1385.00000
mean 228541.04380 2178036.53577 143647.21524 114.41700 97.87706 2231.53473 4340.34511 10.08506 2.43466
std 410018.91819 1141709.89618 75298.75851 1755.26390 1533.05761 650.13959 8126.94081 44.78332 1.06519
min 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 2.22459 1.00000 1.00000
25% 90000.00000 1349453.60000 89000.00000 40.00000 38.00000 1815.48583 1973.75000 2.00000 2.00000
50% 130000.00000 1849812.80000 122000.00000 55.00000 50.00000 2151.64234 2411.23950 4.00000 2.00000
75% 191134.00000 2653420.00000 175000.00000 79.00000 73.00000 2577.00358 2958.15746 7.00000 3.00000
max 5487000.00000 6034635.20000 398000.00000 62034.00000 62034.00000 5975.60976 57156.25000 616.00000 6.00000

The statistics above confirms. While most of the apartments in the dataset are smaller that 73 square meters, there are some that are several thousand square meters. The best thing to do is to change our wrangle function and remove them from the dataset. (I will be repeating the wrangle every time I update it to help with visualization and understanding but in real project I would written it only once with all the updates)

In [11]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]
    
    # Remove outliers by "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]

    # Return new Data Frame
    return df
In [12]:
df = wrangle("data-1.csv")
df.shape
Out[12]:
(1343, 16)
In [13]:
# Plotting a histogram related to the price per m2
plt.hist(df["surface_covered_in_m2"])

# Detailing title and axis, ';' to supress any text output
plt.title("Distribution of Apartment Sizes")
plt.xlabel("Area [sq meters]")
plt.ylabel("Frequency");

Without massive outliers the histogram looks much better

Now I will check if there is a proper relationship between apartment size and price. A scatter plot will be apropriate

In [14]:
# Plotting a scatter plot
plt.scatter(df["surface_covered_in_m2"], df["price_aprox_usd"])

# Detailing title and axis, ';' to supress any text output
plt.title(" Area X Price")
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
Out[14]:
Text(0, 0.5, 'Price [USD]')

This plot suggests that there is a moderate positive correlation between apartment price and size. This means that if the price is the goal of the prediction, size will be a good feature to include.

Since apartment location will be considered for the final prediction, specifically, latitude and longitude it will be interesting to separate them and also, looking at the info it is noticeable that the data type is object. In order to build a model, it is necessary that latitude and longitude to each be in their own column where the data type is float.

In [15]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]
    
    # Remove outliers by "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]
    
    # Split the "lat-lon" col
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)

    # Return new Data Frame
    return df
In [16]:
df = wrangle("data-1.csv")
df.head()
Out[16]:
operation property_type place_with_parent_names price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms expenses properati_url lat lon
4 sell apartment |Argentina|Capital Federal|Chacarita| 129000.0 USD 1955949.6 129000.0 76.0 70.0 1697.368421 1842.857143 NaN NaN NaN http://chacarita.properati.com.ar/10qlv_venta_... -34.584651 -58.454693
9 sell apartment |Argentina|Capital Federal|Villa Luro| 87000.0 USD 1319128.8 87000.0 48.0 42.0 1812.500000 2071.428571 NaN NaN NaN http://villa-luro.properati.com.ar/12m82_venta... -34.638979 -58.500115
29 sell apartment |Argentina|Capital Federal|Caballito| 118000.0 USD 1789163.2 118000.0 NaN 54.0 NaN 2185.185185 NaN 2.0 NaN http://caballito.properati.com.ar/11wqh_venta_... -34.615847 -58.459957
40 sell apartment |Argentina|Capital Federal|Constitución| 57000.0 USD 864256.8 57000.0 42.0 42.0 1357.142857 1357.142857 5.0 2.0 364 http://constitucion.properati.com.ar/k2f0_vent... -34.625222 -58.382382
41 sell apartment |Argentina|Capital Federal|Once| 90000.0 USD 1364616.0 90000.0 57.0 50.0 1578.947368 1800.000000 NaN 3.0 450 http://once.properati.com.ar/suwa_venta_depart... -34.610610 -58.412511

I will plot a map using plotly to visualize how well the DataFrame depicts the apartments and its prices

In [17]:
fig = px.scatter_mapbox(
    df,  # DataFrame
    lat='lat',
    lon='lon',
    width=600,  # Width of map
    height=600,  # Height of map
    color='price_aprox_usd',
    hover_data=["price_aprox_usd"],  # Display price when hovering mouse over house
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()

Seems ok but there are probably more properties. Using glob I will be concatenating all files regarding the folder:

In [18]:
files = glob("data-*.csv")
files
Out[18]:
['data\\data-1.csv',
 'data\\data-2.csv',
 'data\\data-3.csv',
 'data\\data-4.csv',
 'data\\data-5.csv']
In [19]:
frames = []
for file in files:
    frames.append(wrangle(file))
len(frames)
Out[19]:
5
In [20]:
df = pd.concat(frames)
df.head()
Out[20]:
operation property_type place_with_parent_names price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms expenses properati_url lat lon
4 sell apartment |Argentina|Capital Federal|Chacarita| 129000.0 USD 1955949.6 129000.0 76.0 70.0 1697.368421 1842.857143 NaN NaN NaN http://chacarita.properati.com.ar/10qlv_venta_... -34.584651 -58.454693
9 sell apartment |Argentina|Capital Federal|Villa Luro| 87000.0 USD 1319128.8 87000.0 48.0 42.0 1812.500000 2071.428571 NaN NaN NaN http://villa-luro.properati.com.ar/12m82_venta... -34.638979 -58.500115
29 sell apartment |Argentina|Capital Federal|Caballito| 118000.0 USD 1789163.2 118000.0 NaN 54.0 NaN 2185.185185 NaN 2.0 NaN http://caballito.properati.com.ar/11wqh_venta_... -34.615847 -58.459957
40 sell apartment |Argentina|Capital Federal|Constitución| 57000.0 USD 864256.8 57000.0 42.0 42.0 1357.142857 1357.142857 5.0 2.0 364 http://constitucion.properati.com.ar/k2f0_vent... -34.625222 -58.382382
41 sell apartment |Argentina|Capital Federal|Once| 90000.0 USD 1364616.0 90000.0 57.0 50.0 1578.947368 1800.000000 NaN 3.0 450 http://once.properati.com.ar/suwa_venta_depart... -34.610610 -58.412511
In [21]:
fig = px.scatter_mapbox(
    df,  # DataFrame
    lat='lat',
    lon='lon',
    width=600,  # Width of map
    height=600,  # Height of map
    color='price_aprox_usd',
    hover_data=["price_aprox_usd"],  # Display price when hovering mouse over house
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()

Just by looking at the map it is already noticeable that the closer the apartment is to the coast, the more expensive it gets. Looking at that also reminded me that neighborhoods also usually play an important role in an apartment price. I will be investigating if that is a correct assumption. In the available data the column place_with_parent_names has three informations: Country | City | Neighboorhood, since the latter is imporant now I will create a column with it

In [22]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]
    
    # Remove outliers by "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]
    
    # Split the "lat-lon" col
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)
    
    # Creating a neighboorhood column
    df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]

    # Return new Data Frame
    return df
In [23]:
df = wrangle("data-1.csv")
df.head()
Out[23]:
operation property_type place_with_parent_names price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms expenses properati_url lat lon neighborhood
4 sell apartment |Argentina|Capital Federal|Chacarita| 129000.0 USD 1955949.6 129000.0 76.0 70.0 1697.368421 1842.857143 NaN NaN NaN http://chacarita.properati.com.ar/10qlv_venta_... -34.584651 -58.454693 Chacarita
9 sell apartment |Argentina|Capital Federal|Villa Luro| 87000.0 USD 1319128.8 87000.0 48.0 42.0 1812.500000 2071.428571 NaN NaN NaN http://villa-luro.properati.com.ar/12m82_venta... -34.638979 -58.500115 Villa Luro
29 sell apartment |Argentina|Capital Federal|Caballito| 118000.0 USD 1789163.2 118000.0 NaN 54.0 NaN 2185.185185 NaN 2.0 NaN http://caballito.properati.com.ar/11wqh_venta_... -34.615847 -58.459957 Caballito
40 sell apartment |Argentina|Capital Federal|Constitución| 57000.0 USD 864256.8 57000.0 42.0 42.0 1357.142857 1357.142857 5.0 2.0 364 http://constitucion.properati.com.ar/k2f0_vent... -34.625222 -58.382382 Constitución
41 sell apartment |Argentina|Capital Federal|Once| 90000.0 USD 1364616.0 90000.0 57.0 50.0 1578.947368 1800.000000 NaN 3.0 450 http://once.properati.com.ar/suwa_venta_depart... -34.610610 -58.412511 Once
In [24]:
df.groupby("neighborhood")[["price", "surface_covered_in_m2"]].mean().sort_values(by="price", ascending=False)
Out[24]:
price surface_covered_in_m2
neighborhood
Parque Chacabuco 597018.833333 69.000000
Barracas 491165.064516 54.387097
Boedo 453569.750000 50.250000
Saavedra 331669.558824 51.970588
San Cristobal 331016.729730 51.351351
Colegiales 317316.571429 57.380952
Belgrano 314852.141732 57.543307
San Telmo 298507.484848 54.757576
Nuñez 296303.395833 55.583333
Flores 287899.000000 57.500000
Centro / Microcentro 285826.100000 49.500000
Caballito 274527.260000 50.110000
Puerto Madero 263619.545455 56.090909
Constitución 248722.222222 51.777778
Las Cañitas 228626.875000 62.750000
Parque Centenario 217940.000000 48.700000
Recoleta 208590.800000 68.766667
Congreso 208125.000000 57.375000
Coghlan 194600.600000 40.300000
Barrio Norte 183231.763158 64.052632
Villa Luro 182166.875000 59.250000
Palermo 179831.396825 53.293651
Villa Urquiza 172493.882353 52.397059
Almagro 160323.044444 54.000000
Retiro 155400.000000 58.833333
Monserrat 143211.538462 49.384615
Balvanera 141458.622222 55.488889
Villa Crespo 140503.931818 54.000000
Chacarita 138555.555556 57.222222
Abasto 136000.000000 64.000000
Monte Castro 129316.666667 65.000000
Villa General Mitre 124000.000000 69.333333
Villa Devoto 120536.411765 53.882353
Tribunales 120000.000000 47.000000
Villa Pueyrredón 118163.333333 56.666667
Floresta 117876.470588 65.058824
Versalles 116000.000000 62.000000
Once 115275.000000 65.000000
Villa del Parque 113889.866667 54.466667
Paternal 113065.500000 50.812500
Parque Patricios 113033.333333 53.500000
San Nicolás 111692.307692 57.153846
Liniers 110714.285714 62.571429
Villa Ortuzar 104000.000000 44.500000
Parque Chas 103722.333333 45.333333
102375.000000 45.812500
Agronomía 101250.000000 39.250000
Mataderos 99613.888889 54.888889
Boca 98666.666667 59.200000
Velez Sarsfield 92000.000000 52.000000
Parque Avellaneda 82400.000000 49.200000
Villa Santa Rita 81683.333333 43.000000
Villa Lugano 75533.333333 63.833333

Even though the average sizes are similar the average prices are not it is clear that some neighborhoods are more prestigious than others

But ok, given the more obvious assumptions, there are more features to be explored in the DataFrame which can be a little bit tricky to identify its value. So for now, I will be using all features.

In [25]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1343 entries, 4 to 8604
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   operation                   1343 non-null   object 
 1   property_type               1343 non-null   object 
 2   place_with_parent_names     1343 non-null   object 
 3   price                       1343 non-null   float64
 4   currency                    1343 non-null   object 
 5   price_aprox_local_currency  1343 non-null   float64
 6   price_aprox_usd             1343 non-null   float64
 7   surface_total_in_m2         965 non-null    float64
 8   surface_covered_in_m2       1343 non-null   float64
 9   price_usd_per_m2            927 non-null    float64
 10  price_per_m2                1343 non-null   float64
 11  floor                       379 non-null    float64
 12  rooms                       1078 non-null   float64
 13  expenses                    349 non-null    object 
 14  properati_url               1343 non-null   object 
 15  lat                         1300 non-null   float64
 16  lon                         1300 non-null   float64
 17  neighborhood                1343 non-null   object 
dtypes: float64(11), object(7)
memory usage: 199.4+ KB

The information I have the better, I can impute missing values, however, there still needs to be enough data in a column to do a good imputation. So, if more than half of the data in a column is missing, I will be dropping it then try imputing.

In [26]:
df.isnull().sum() / len(df)
Out[26]:
operation                     0.000000
property_type                 0.000000
place_with_parent_names       0.000000
price                         0.000000
currency                      0.000000
price_aprox_local_currency    0.000000
price_aprox_usd               0.000000
surface_total_in_m2           0.281459
surface_covered_in_m2         0.000000
price_usd_per_m2              0.309754
price_per_m2                  0.000000
floor                         0.717796
rooms                         0.197319
expenses                      0.740134
properati_url                 0.000000
lat                           0.032018
lon                           0.032018
neighborhood                  0.000000
dtype: float64

Seems like floor and expenses have a lot of NaN values so they definitely need to go

In [27]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]
    
    # Remove outliers by "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]
    
    # Split the "lat-lon" col
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)
    
    # Creating a neighboorhood column
    df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
    
    # Drop columns with more than half null values
    df.drop(columns=["floor", "expenses"], inplace=True)

    # Return new Data Frame
    return df
In [28]:
df = wrangle("data-1.csv")
df.head()
Out[28]:
operation property_type place_with_parent_names price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 rooms properati_url lat lon neighborhood
4 sell apartment |Argentina|Capital Federal|Chacarita| 129000.0 USD 1955949.6 129000.0 76.0 70.0 1697.368421 1842.857143 NaN http://chacarita.properati.com.ar/10qlv_venta_... -34.584651 -58.454693 Chacarita
9 sell apartment |Argentina|Capital Federal|Villa Luro| 87000.0 USD 1319128.8 87000.0 48.0 42.0 1812.500000 2071.428571 NaN http://villa-luro.properati.com.ar/12m82_venta... -34.638979 -58.500115 Villa Luro
29 sell apartment |Argentina|Capital Federal|Caballito| 118000.0 USD 1789163.2 118000.0 NaN 54.0 NaN 2185.185185 2.0 http://caballito.properati.com.ar/11wqh_venta_... -34.615847 -58.459957 Caballito
40 sell apartment |Argentina|Capital Federal|Constitución| 57000.0 USD 864256.8 57000.0 42.0 42.0 1357.142857 1357.142857 2.0 http://constitucion.properati.com.ar/k2f0_vent... -34.625222 -58.382382 Constitución
41 sell apartment |Argentina|Capital Federal|Once| 90000.0 USD 1364616.0 90000.0 57.0 50.0 1578.947368 1800.000000 3.0 http://once.properati.com.ar/suwa_venta_depart... -34.610610 -58.412511 Once

My next step will be taking out the low and high cardinality values. If there is only one category in a column, it will not provide any unique information to the model. At the other extreme, columns where nearly every row has its own category will not help the model in identifying useful trends in the data.

In [29]:
# Selecting all columns with objects as type, that is, identifiable as a category column
df.select_dtypes("object").head()
Out[29]:
operation property_type place_with_parent_names currency properati_url neighborhood
4 sell apartment |Argentina|Capital Federal|Chacarita| USD http://chacarita.properati.com.ar/10qlv_venta_... Chacarita
9 sell apartment |Argentina|Capital Federal|Villa Luro| USD http://villa-luro.properati.com.ar/12m82_venta... Villa Luro
29 sell apartment |Argentina|Capital Federal|Caballito| USD http://caballito.properati.com.ar/11wqh_venta_... Caballito
40 sell apartment |Argentina|Capital Federal|Constitución| USD http://constitucion.properati.com.ar/k2f0_vent... Constitución
41 sell apartment |Argentina|Capital Federal|Once| USD http://once.properati.com.ar/suwa_venta_depart... Once
In [30]:
# Counting how many unique objects there are in those columns
df.select_dtypes("object").nunique()
Out[30]:
operation                     1
property_type                 1
place_with_parent_names      53
currency                      2
properati_url              1343
neighborhood                 53
dtype: int64
In [31]:
len(df)
Out[31]:
1343

Looking at those values it is easy to see that operation, property_type and currency have only one or two values. On the other hand property_url has too many values, in fact, one different for every row in the DataFrame. Updating the wrangle function is the best approach.

In [32]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]
    
    # Remove outliers by "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]
    
    # Split the "lat-lon" col
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)
    
    # Creating a neighboorhood column
    df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
    
    # Drop columns with more than half null values
    df.drop(columns=["floor", "expenses"], inplace=True)
    
    # Drop low and high cardinality categorical variables
    df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)

    # Return new Data Frame
    return df
In [33]:
df = wrangle("data-1.csv")
df.head()
Out[33]:
place_with_parent_names price price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 rooms lat lon neighborhood
4 |Argentina|Capital Federal|Chacarita| 129000.0 1955949.6 129000.0 76.0 70.0 1697.368421 1842.857143 NaN -34.584651 -58.454693 Chacarita
9 |Argentina|Capital Federal|Villa Luro| 87000.0 1319128.8 87000.0 48.0 42.0 1812.500000 2071.428571 NaN -34.638979 -58.500115 Villa Luro
29 |Argentina|Capital Federal|Caballito| 118000.0 1789163.2 118000.0 NaN 54.0 NaN 2185.185185 2.0 -34.615847 -58.459957 Caballito
40 |Argentina|Capital Federal|Constitución| 57000.0 864256.8 57000.0 42.0 42.0 1357.142857 1357.142857 2.0 -34.625222 -58.382382 Constitución
41 |Argentina|Capital Federal|Once| 90000.0 1364616.0 90000.0 57.0 50.0 1578.947368 1800.000000 3.0 -34.610610 -58.412511 Once

Another important cleaning process is to drop any column that would constitute leakage, that is, features that were created using our target or that would give our model information that it won't have access to when it's deployed

In [34]:
df.columns.to_list()
Out[34]:
['place_with_parent_names',
 'price',
 'price_aprox_local_currency',
 'price_aprox_usd',
 'surface_total_in_m2',
 'surface_covered_in_m2',
 'price_usd_per_m2',
 'price_per_m2',
 'rooms',
 'lat',
 'lon',
 'neighborhood']

Analyzing the columns it is noticeable that a lot of those columns are considered leakages because since I am trying to build a model to predict sale prices I cannot have them before it happens. So price, price_aprox_local_currency, price_usd_per_m2, price_per_m2. price_aprox_usd will not be dropped because I will be using it as a target and my model will need it.

In [35]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]
    
    # Remove outliers by "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]
    
    # Split the "lat-lon" col
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)
    
    # Creating a neighboorhood column
    df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
    
    # Drop columns with more than half null values
    df.drop(columns=["floor", "expenses"], inplace=True)
    
    # Drop low and high cardinality categorical variables
    df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)
    
    # Drop leakage 
    df.drop(columns=['price', 'price_aprox_local_currency', 'price_per_m2', 'price_usd_per_m2'], inplace=True)

    # Return new Data Frame
    return df
In [36]:
df = wrangle("data-1.csv")
df.head()
Out[36]:
place_with_parent_names price_aprox_usd surface_total_in_m2 surface_covered_in_m2 rooms lat lon neighborhood
4 |Argentina|Capital Federal|Chacarita| 129000.0 76.0 70.0 NaN -34.584651 -58.454693 Chacarita
9 |Argentina|Capital Federal|Villa Luro| 87000.0 48.0 42.0 NaN -34.638979 -58.500115 Villa Luro
29 |Argentina|Capital Federal|Caballito| 118000.0 NaN 54.0 2.0 -34.615847 -58.459957 Caballito
40 |Argentina|Capital Federal|Constitución| 57000.0 42.0 42.0 2.0 -34.625222 -58.382382 Constitución
41 |Argentina|Capital Federal|Once| 90000.0 57.0 50.0 3.0 -34.610610 -58.412511 Once

Finally, the last issue to keep an eye out for is multicollinearity, that is, features in the feature matrix that are highly correlated with each other. A good way to detect this is to use a heatmap.

In [37]:
# Get the correlation between all number type columns
corr = df.select_dtypes("number").drop(columns="price_aprox_usd").corr()

# Plot the heatmap
sns.heatmap(corr);

Seems like thare are 3 features that are highly correlated with each other: rooms, surface_covered_in_m2 and surface_total_in_m2. So the ideal would be to keep only one of them. To find out wich one it is the best to keep I will be looking at the data info

In [38]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1343 entries, 4 to 8604
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   place_with_parent_names  1343 non-null   object 
 1   price_aprox_usd          1343 non-null   float64
 2   surface_total_in_m2      965 non-null    float64
 3   surface_covered_in_m2    1343 non-null   float64
 4   rooms                    1078 non-null   float64
 5   lat                      1300 non-null   float64
 6   lon                      1300 non-null   float64
 7   neighborhood             1343 non-null   object 
dtypes: float64(6), object(2)
memory usage: 94.4+ KB

The best out of the 3 is surface_covered_in_m2 because it has no NaN values

In [39]:
def wrangle(filepath):
    
    """
    Read CSV files into a 'DataFrame'.

    Returns only property apartments in Capital Federal which the price is less than 400 000 USD.

    Parameters
    ----------
    filepath : str
        Location of CSV file.
    """
    
    # Import CSV
    df = pd.read_csv(filepath, encoding="ISO-8859-1")

    # Subset to properties in 'Capital Federal'
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    # Subset to 'apartments'
    mask_apt = df["property_type"] == "apartment"

    # Subset to properties where price is less than 400 000
    mask_price = df["price_aprox_usd"] < 400_000

    # Applying all masks
    df = df[mask_ba & mask_apt & mask_price]
    
    # Remove outliers by "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]
    
    # Split the "lat-lon" col
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)
    
    # Creating a neighboorhood column
    df["neighborhood"] = df["place_with_parent_names"].str.split('|', 0, expand=True)[3]
    
    # Drop columns with more than half null values
    df.drop(columns=["floor", "expenses"], inplace=True)
    
    # Drop low and high cardinality categorical variables
    df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)
    
    # Drop leakage 
    df.drop(columns=['price', 'price_aprox_local_currency', 'price_per_m2', 'price_usd_per_m2'], inplace=True)
    
    # Drop columns with multicollinearlity
    df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)

    # Return new Data Frame
    return df
In [40]:
frames = []
for file in files:
    frames.append(wrangle(file))
df = pd.concat(frames)
df.head()
Out[40]:
place_with_parent_names price_aprox_usd surface_covered_in_m2 lat lon neighborhood
4 |Argentina|Capital Federal|Chacarita| 129000.0 70.0 -34.584651 -58.454693 Chacarita
9 |Argentina|Capital Federal|Villa Luro| 87000.0 42.0 -34.638979 -58.500115 Villa Luro
29 |Argentina|Capital Federal|Caballito| 118000.0 54.0 -34.615847 -58.459957 Caballito
40 |Argentina|Capital Federal|Constitución| 57000.0 42.0 -34.625222 -58.382382 Constitución
41 |Argentina|Capital Federal|Once| 90000.0 50.0 -34.610610 -58.412511 Once

Build Model¶

Now it is time to build the model, first I will be splitting the data

In [41]:
# The target is the aproximate price
target = "price_aprox_usd"

# The feature matrix is the rest of the DataFrame
X_train = df[['surface_covered_in_m2', 'lat', 'lon', 'neighborhood']]
y_train = df[target]

Baseline¶

To have a better understanding of how well the model is performing I will be comparing it to the baseline mean absolute error

In [42]:
# The overall mean will the the baseline
y_mean = y_train.mean()

# One value for each row
y_pred_baseline = [y_mean] * len(y_train)
print("Mean apt price:", round(y_mean, 2))

print("Baseline MAE:", round(mean_absolute_error(y_train, y_pred_baseline), 2))
Mean apt price: 132383.84
Baseline MAE: 44860.11

The average apartment price is around 133k and the baseline MAE is around 45k. That is, if the overall mean was used to predict the value of every apartment it would be missing the price by around 45k.

For my model I will be using OneHotEncoder because of the neighborhood column and SimpleImputer for the missing values. Before choosing the model it would be interesting to see how many unique neighborhood there are

In [43]:
df.neighborhood.nunique()
Out[43]:
57
In [44]:
model = make_pipeline(OneHotEncoder(use_cat_names=True), SimpleImputer(), LinearRegression())
model.fit(X_train, y_train)
Out[44]:
Pipeline(steps=[('onehotencoder',
                 OneHotEncoder(cols=['neighborhood'], use_cat_names=True)),
                ('simpleimputer', SimpleImputer()),
                ('linearregression', LinearRegression())])
In [45]:
intercept = model.named_steps['linearregression'].intercept_
coefficients = model.named_steps['linearregression'].coef_
print("coefficients len:", len(coefficients))
print(coefficients[:5])  # First five coefficients
coefficients len: 60
[  2208.54913208 247000.81749086 148752.4894247     561.92516784
  13068.13103889]

Evaluating¶

Calculating the training mean absolute error for the predictions as compared to the true targets in y_train

In [46]:
# Predicting values
y_pred_training = model.predict(X_train)

# How good is its MAE
print("Training MAE:", round(mean_absolute_error(y_train, y_pred_training), 2))
Training MAE: 24237.29

The model has beaten the mean MAE by around 20k which is a good sign

Comunicating the Results¶

Communicating the results may depends on where the model will be applied. Here, howevere, I will be using it on an interactive widget where the user can input the information and the results will be presented

In [47]:
def make_prediction(area, lat, lon, neighborhood):
    
    """
    Input the area, latitude, longitude and neighborhood of a house into the Linear Regression model

    Returns the apartment predicted price

    Parameters
    ----------
    area: float
        The size of the house in square feet
    
    lat: float
        Latitude of the house
        
    lon: float
        Longitude of the house
        
    neighborhood: str
        Neighborhood of the house, where it is located
    """
    
    # Create a dictionary with the received info
    data = {
        "surface_covered_in_m2": area,
        "lat": lat,
        "lon": lon,
        "neighborhood": neighborhood
    }
    
    # Create a DataFrame with the dictionary
    df = pd.DataFrame(data, index=[0])
    
    # Make a prediction
    prediction = model.predict(df).round(2)[0]
    
    # Return the predicted value
    return f"Predicted apartment price: ${prediction}"
In [48]:
# Using the fuction for the first time
make_prediction(110, -34.60, -58.46, "Villa Crespo")
Out[48]:
'Predicted apartment price: $248868.18'

Creating the interactive widget

In [49]:
# Invoke the interact method
interact(
    make_prediction, # Pass it the predict function
    area=IntSlider( # Start and end of the area slider
        min=X_train["surface_covered_in_m2"].min(),
        max=X_train["surface_covered_in_m2"].max(),
        value=X_train["surface_covered_in_m2"].mean(),
    ),
    lat=FloatSlider( # Start and end of the latitude slider
        min=X_train["lat"].min(),
        max=X_train["lat"].max(),
        step=0.01,
        value=X_train["lat"].mean(),
    ),
    lon=FloatSlider(  # Start and end of the longitude slider
        min=X_train["lon"].min(),
        max=X_train["lon"].max(),
        step=0.01,
        value=X_train["lon"].mean(),
    ),
    neighborhood=Dropdown(options=sorted(X_train["neighborhood"].unique())), # Dropdown with all the neighborhoods
);
interactive(children=(IntSlider(value=53, description='area', max=101, min=30), FloatSlider(value=-34.59890626…

Widget

Feature Importance¶

In [50]:
# Get the model coeficients
coefficients = model.named_steps['linearregression'].coef_
In [51]:
# Get feature names
feature_names = model.named_steps["onehotencoder"].get_feature_names() 

# How many features
print("features len:", len(feature_names)) 
print(feature_names[:5])
features len: 60
['surface_covered_in_m2', 'lat', 'lon', 'neighborhood_Chacarita', 'neighborhood_Villa Luro']
In [52]:
# Transform list into Series
feat_imp = pd.Series(coefficients, index=feature_names) 
feat_imp.head()
Out[52]:
surface_covered_in_m2        2208.549132
lat                        247000.817491
lon                        148752.489425
neighborhood_Chacarita        561.925168
neighborhood_Villa Luro     13068.131039
dtype: float64
In [53]:
# Plot the features with the highest importances
feat_imp.sort_values(key=abs).tail(15).plot(kind='barh')

# Plot and update
plt.xlabel('Importance [USD]')
plt.ylabel('Feature')
plt.title('Feature Importance for Apartment Price')
plt.show()